DS 6030 | Spring 2023 | University of Virginia


# set working directory
setwd("/Users/matthewscheffel/Desktop/MSDS/DS 6030/Disaster Relief Project")
getwd()
#> [1] "/Users/matthewscheffel/Desktop/MSDS/DS 6030/Disaster Relief Project"
# load in the necessary packages
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
#> ✔ tibble  3.1.8     ✔ dplyr   1.1.0
#> ✔ tidyr   1.3.0     ✔ stringr 1.5.0
#> ✔ readr   2.1.2     ✔ forcats 0.5.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(caret)
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> 
#> The following object is masked from 'package:purrr':
#> 
#>     lift
library(dplyr)
library(glmnet)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> 
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> 
#> Loaded glmnet 4.1-6
library(ggplot2)
library(GGally)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
library(ROCR)
library(plotly)
#> 
#> Attaching package: 'plotly'
#> 
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> 
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> 
#> The following object is masked from 'package:graphics':
#> 
#>     layout
library(ggcorrplot)
library(ggpubr)
library(cowplot)
#> 
#> Attaching package: 'cowplot'
#> 
#> The following object is masked from 'package:ggpubr':
#> 
#>     get_legend
library(parallel)
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> 
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows
library(tidyr)
library(pROC)
#> Type 'citation("pROC")' for a citation.
#> 
#> Attaching package: 'pROC'
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     cov, smooth, var

Data Wrangling and EDA

To begin this project, I will load in the data and begin an Exploratory Data Analysis of the dataset. I will examine the underlying statistics of the data and create a number of data visualizations.

# import data
data <- read.csv("HaitiPixels.csv", sep=",", header=TRUE)

# head of the data
head(data)
#>        Class Red Green Blue
#> 1 Vegetation  64    67   50
#> 2 Vegetation  64    67   50
#> 3 Vegetation  64    66   49
#> 4 Vegetation  75    82   53
#> 5 Vegetation  74    82   54
#> 6 Vegetation  72    76   52
# structure of the data frame
str(data)
#> 'data.frame':    63241 obs. of  4 variables:
#>  $ Class: chr  "Vegetation" "Vegetation" "Vegetation" "Vegetation" ...
#>  $ Red  : int  64 64 64 75 74 72 71 69 68 67 ...
#>  $ Green: int  67 67 66 82 82 76 72 70 70 70 ...
#>  $ Blue : int  50 50 49 53 54 52 51 49 49 50 ...
# summary statistics
summary(data)
#>     Class                Red          Green            Blue      
#>  Length:63241       Min.   : 48   Min.   : 48.0   Min.   : 44.0  
#>  Class :character   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
#>  Mode  :character   Median :163   Median :148.0   Median :123.0  
#>                     Mean   :163   Mean   :153.7   Mean   :125.1  
#>                     3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
#>                     Max.   :255   Max.   :255.0   Max.   :255.0

The data set includes the following variables:

“Class” is a categorical variable with five categories describing the type of land (vegetation, soil, rooftop, non-tarp, and blue-tarp) contained within the images. “Red”, “Green”, and “Blue” are numerical variables representing the intensity of each color in the pixels of the image for each land category.

data$Blue_Tarp <- ifelse(data$Class == "Blue Tarp", "Yes", "No")
data$Blue_Tarp <- factor(data$Blue_Tarp, levels = c("No", "Yes"))

After loading and examining the data, I created a new variable called “Blue_Tarp” that checks whether the “Class” column has values that are equal to the “Blue Tarp” variable. If they are equal, the Blue_Tarp variable is set to “Yes” (and set to “No” if they are not equal.) I then converted the new variable (Blue_Tarp) to a factor with levels of “No” and “Yes”. Adding this binary value makes sense since we are only interested in predicting if a tarp pixel is blue or not (as opposed to differentiating between multiple colors.)

Now, I will create and examine a number of data visualizations.

Histograms of Pixel Values by Tarp Status

data %>%
  select(Blue, Green, Red, Blue_Tarp) %>%
  gather(key = "Color", value = "Value", Blue:Red) %>%
  ggplot(aes(x = Value, fill = Blue_Tarp)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 50) +
  scale_fill_manual(values = c("#C0C0C0", "#0000FF"), name = "Blue Tarp") +
  facet_wrap(~Color, scales = "free_x", nrow = 1) +
  labs(title = "Histogram of Pixel Values by Color", x = NULL, y = "Count")

Correlation Visualization of Color Values

# select the color columns
color_cols <- c("Red", "Green", "Blue")

# correlation matrix
cor_mat <- cor(data[, color_cols])

# heatmap
ggcorrplot(cor_mat, 
           type = "upper", 
           lab = TRUE, 
           lab_size = 3, 
           method = "circle",
           colors = c("#6D9EC1", "#FFFFFF", "#E46726"),
           title = "Correlation of Color Values")

Desnity Plots of Pixel Values

# Vegetation
vegetation <- data %>%
  filter(Class == "Vegetation") %>%
  ggplot(aes(x = Green)) +
  geom_density(color = 'green') +
  geom_density(aes(x = Red), color = 'red') +
  geom_density(aes(x = Blue), color = 'blue') + 
  labs(x = "Pixel Value", y = "Density", title = "Vegetation") +
  scale_x_continuous(limits = c(0, 255))

# Soil
soil <- data %>%
  filter(Class == "Soil") %>%
  ggplot(aes(x = Green)) +
  geom_density(color = 'brown') +
  geom_density(aes(x = Red), color = 'red') +
  geom_density(aes(x = Blue), color = 'blue') + 
  labs(x = "Pixel Value", y = "Density", title = "Soil") +
  scale_x_continuous(limits = c(0, 255))

# Rooftop
rooftop <- data %>%
  filter(Class == "Rooftop") %>%
  ggplot(aes(x = Green)) +
  geom_density(color = 'gray') +
  geom_density(aes(x = Red), color = 'red') +
  geom_density(aes(x = Blue), color = 'blue') + 
  labs(x = "Pixel Value", y = "Density", title = "Rooftop") +
  scale_x_continuous(limits = c(0, 255))

# Blue Tarp
bluetarp <- data %>%
  filter(Class == "Blue Tarp") %>%
  ggplot(aes(x = Green)) +
  geom_density(color = 'blue') +
  geom_density(aes(x = Red), color = 'red') +
  geom_density(aes(x = Blue), color = 'lightblue') + 
  labs(x = "Pixel Value", y = "Density", title = "Blue Tarp") +
  scale_x_continuous(limits = c(0, 255))

# Various Non-Tarp
non_tarp <- data %>%
  filter(Class == "Various Non-Tarp") %>%
  ggplot(aes(x = Green)) +
  geom_density(color = 'orange') +
  geom_density(aes(x = Red), color = 'red') +
  geom_density(aes(x = Blue), color = 'blue') + 
  labs(x = "Pixel Value", y = "Density", title = "Various Non-Tarp") +
  scale_x_continuous(limits = c(0, 255))

# arrange plots in a grid
plot_grid(vegetation, soil, rooftop, bluetarp, non_tarp, ncol = 3)

Matrix of Scatter Plots, Histograms, Correlations, Box Plots for Color Variables

ggpairs(data[2:5],aes(color = Blue_Tarp, alpha = 0.5))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3D Scatter Plot of Pixel Values

# 3D scatter plot of the pixel values, with colors and size mapped to variables
# source: https://plotly.com/r/3d-scatter-plots/

plot <- plot_ly(data, x = ~Red, y = ~Green, z = ~Blue, color = ~Class)
                #colors = c('blue', 'red', 'brown', 'yellow', 'green'), size = 1)

# add markers
plot <- plot %>% add_markers()
plot

The data visualizations from the Exploratory Data Analysis have revealed a number of useful observations. Blue and Red pixels appear to have the lowest level of correlation. Based on a number of the graphs above, the pixels do not appear to be categorized as “Blue” until they reach a value of approximately 100 and above. However, from the data summary, we see that Blue has the lowest minimum, lowest mean, and lowest median values. This suggests that the blue tarps may cause classification/identification issues due to their ability to be easily mistaken for darker images such as water, shade, etc.

Based on the visualizations, there appears to be low correlation between Blue Tarp pixel values and higher Red/Green pixel values. This may be due to Red and Green’s closer relationship to each other on the color scale than to Blue, which means it’s less likely for a pixel to have both a high Red or Green value and a high Blue value. Ultimately, it’s less likely to find a Blue Tarp pixel with high Red or Green values.

Now, I will begin fitting and testing the data on a number of models.

Model Fitting, Tuning Parameter Selection, and Evaluation

To ensure consistency in the cross-validation process, a seed of 123 was set before I created 10 folds from the data using the createFolds() function. The folds were created specifically for the Blue_Tarp variable and returned as training sets. The trainControl() function was then used to set up the same 10-fold cross-validation procedure for all models, with the folds specified as the index and predictions saved along with class probabilities.

set.seed(123)

# 10 folds for cross-validation
folds <- createFolds(data$Blue_Tarp, k = 10, list = TRUE, returnTrain = TRUE)

# use trainControl from caret package
# source: https://www.rdocumentation.org/packages/caret/versions/6.0-92/topics/trainControl

# trainControl object
control <- trainControl(method = "cv",
                        number = 10,
                        index = folds,
                        savePredictions = TRUE,
                        classProbs = TRUE)

Next, I created some functions to allow for easier analysis of the models using the same standard tests and statistics that are required in this project.

This function defines the thresholds to test and the statistics of interest, and defines a function “Test_Thresholds” which takes a statistical model as the input and outputs the selected statistics for different thresholds. This also computes the false negative/false positive rates and returns the results.

# source: https://www.rdocumentation.org/packages/caret/versions/6.0-92/topics/thresholder

set.seed(123)

# thresholds for testing
thresholds <- seq(0.1, 0.9, by = 0.1)

# statistics of interest for the model
stats_threshold <- c("Accuracy", "Kappa", "Sensitivity", "Specificity", "Precision")

# function to compute statistics for different thresholds
Test_Thresholds <- function(model) {
  library(caret)
  
  # statistics for different thresholds
  # thresholder function from caret package
  results <- thresholder(model, 
                         threshold = thresholds, 
                         statistics = stats_threshold)
  
  # false negative and false positive rates
  results$falseNeg <- 1 - results$Sensitivity
  results$falsePos <- 1 - results$Specificity
  
  # Return the results
  return(results)
}

The “ROC_Plot” function plots the ROC curve and calculates the AUC for a given (binary) classification model. It takes the model, selected performance metrics, and the model name as input. The function uses the prediction probabilities from the model to create a ROC curve, which is plotted and displayed with the AUC value. The AUC value is added to the model_stats data frame, which is returned at the end of the function.

# source: https://www.rdocumentation.org/packages/pROC/versions/1.18.0/topics/roc

ROC_Plot <- function(model, model_stats, model_name, seed = 123) {
  set.seed(seed)
  
  prob <- model$pred[order(model$pred$rowIndex),]
  
  rates <- prediction(prob$Yes,as.numeric(data$Blue_Tarp))
  roc <- performance(rates, measure = "tpr", x.measure = "fpr")
  plot(roc, main = paste("ROC Curve:", model_name))
  lines(x = c(0,1), y = c(0,1), col = "red")
  
  auc <- performance(rates, "auc")
  model_stats$AUROC <- auc@y.values[[1]]
  return(model_stats)
}

Now, I will run each model on the data set as well as use the aforementioned functions to determine the appropriate thresholds and ROC/AUC values. The models will follow this basic formula:

\[BlueTarp = RedX_1 + GreenX_2 + BlueX_3\]

Logistic Regression

This model uses the train() function of the caret package to create logistic regression models with binomial family and GLM method. The model is cross-validated using trainControl() to calculate accuracy and kappa statistics.

set.seed(123)

GLM_Reg<-train(Blue_Tarp~Red+Green+Blue,
      data = data,
      family = "binomial",
      method = "glm",
      trControl = control)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
GLM_Reg
#> Generalized Linear Model 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56916, 56917, 56916, 56918, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy  Kappa    
#>   0.995272  0.9203271

The threshold test is used to compare the performance of the model across multiple threshold values.

GLM_Reg_T <- Test_Thresholds(GLM_Reg)
GLM_Reg_T[2:9] %>% slice_max(Accuracy)
#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.7 0.9956199 0.9277236   0.9984318   0.9104741 0.9970478
#>      falseNeg   falsePos
#> 1 0.001568152 0.08952592

From the threshold test, we can see that at a probability threshold of 0.7, the accuracy was 0.996, kappa was 0.928, sensitivity was 0.998, specificity was 0.910, and precision was 0.997. In this case, approximately 9% of the non-blue tarp values were classified as a blue tarp by the model. On the other hand, only approximately 0.16% of the blue tarp values were classified as non-blue tarp by this model. This tells us that the model has a high sensitivity but a relatively lower specificity, which means that it is adept at identifying blue tarps but may also mis-classify some non-blue tarps as blue. Overall, these stats suggest that the model performs well at this threshold value.

GLM_Final <- GLM_Reg_T[2:9] %>% slice_max(Accuracy)
GLM_Final <- ROC_Plot(GLM_Reg, GLM_Final, "Logistic Regression")

The Logistic Regression model has an AUROC value of 0.9985.

Linear Discriminant Analysis

This model uses the train() function of the caret package to create an LDA method model. The model is cross-validated using trainControl() to calculate accuracy and kappa statistics.

set.seed(123)

LDA_Data <- train(Blue_Tarp ~ Red + Green + Blue, data = data,
                  method = "lda",
                  trControl = control)

LDA_Data
#> Linear Discriminant Analysis 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56916, 56917, 56916, 56918, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9839344  0.7531086

The threshold test is used to compare the performance of the model across multiple threshold values.

LDA_Data_T <- Test_Thresholds(LDA_Data)
LDA_Data_T[2:9] %>% slice_max(Accuracy)
#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.1 0.9846144 0.7471329   0.9926493   0.7413403 0.9914678
#>      falseNeg  falsePos
#> 1 0.007350659 0.2586597

From the threshold test, we can see that at a probability threshold of 0.1, the model has an accuracy of 0.985 and a sensitivity of 0.993, which means that the model correctly identified most of the positive cases. However, the specificity is relatively low at 0.741, indicating that the model has a higher false positive rate, meaning it incorrectly identified some negative cases as positive. This trade-off between sensitivity and specificity can be adjusted by changing the probability threshold value.

Since the threshold may need to be adjusted, I will test for a superior value.

# maximize accuracy
accuracy <- "Accuracy"

# threshold that maximizes accuracy
optimal_threshold <- LDA_Data_T %>% 
  filter_at(vars(contains(accuracy)), all_vars(!is.na(.))) %>%  # missing values
  slice(which.max(get(accuracy))) %>%  # get the row with the highest performance metric value
  pull(prob_threshold)  # optimal threshold

# print optimal threshold
cat("Optimal threshold value based on Accuracy:", optimal_threshold, "\n")
#> Optimal threshold value based on Accuracy: 0.1
# maximize precision
precision <- "Precision"

# threshold that maximizes precision
optimal_threshold <- LDA_Data_T %>% 
  filter_at(vars(contains(precision)), all_vars(!is.na(.))) %>%  # missing values
  slice(which.max(get(precision))) %>%  # get the row with the highest performance metric value
  pull(prob_threshold)  # optimal threshold

# print optimal threshold
cat("Optimal threshold value based on Precision:", optimal_threshold, "\n")
#> Optimal threshold value based on Precision: 0.9
# maximize specificity
specificity <- "Specificity"

# threshold that maximizes specificity
optimal_threshold <- LDA_Data_T %>% 
  filter_at(vars(contains(specificity)), all_vars(!is.na(.))) %>%  # missing values
  slice(which.max(get(specificity))) %>%  # get the row with the highest performance metric value
  pull(prob_threshold)  # optimal threshold

# print optimal threshold
cat("Optimal threshold value based on Specificity:", optimal_threshold, "\n")
#> Optimal threshold value based on Specificity: 0.9

Running a few tests seems to indicate that a higher threshold would be better. I will adjust my model to use a threshold value of 0.9.

LDA_Data_T$prob_threshold <- 0.9
LDA_Final <- LDA_Data_T[2:9] %>% slice_max(Accuracy)
LDA_Final <- ROC_Plot(LDA_Data, LDA_Final, "LDA")

The AUROC value for the LDA model is 0.9889.

Quadratic Discriminant Analyis

This model uses the train() function of the caret package to create a QDA method model. The model is cross-validated using trainControl() to calculate accuracy and kappa statistics.

set.seed(123)

QDA_Data <- train(Blue_Tarp~Red+Green+Blue, data=data,
                  method="qda",
                  trControl=control)

QDA_Data
#> Quadratic Discriminant Analysis 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56916, 56917, 56916, 56918, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9945763  0.9052171

The threshold test is used to compare the performance of the model across multiple threshold values.

QDA_Data_T <- Test_Thresholds(QDA_Data)
QDA_Data_T[2:9] %>% slice_max(Accuracy)
#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.7 0.9947186 0.9090382   0.9993139   0.8555797 0.9952505
#>       falseNeg  falsePos
#> 1 0.0006860556 0.1444203

From the threshold test, we can infer that at a probability threshold of 0.7, the model has a strong overall accuracy of 0.9947 and a high sensitivity value of 0.9993, indicating that it is effective at identifying true positives. However, it also has a relatively high false positive rate of over 14%.

QDA_Final <- QDA_Data_T[2:9] %>% slice_max(Accuracy)
QDA_Final <- ROC_Plot(QDA_Data, QDA_Final, "QDA")

The QDA model has an AUROC value of 0.9982.

K-Nearest Neighbor

This model uses the train() function of the caret package to create a KNN method model. The model is cross-validated using trainControl() to calculate accuracy and kappa statistics.

First, I attempted to create a plot to help determine the accuracy of each K-value:

# grid of tuning parameters for k in KNN model
#knn_grid <- data.frame(k = seq(5, 50, 10))

# train KNN model
#KNN_Data <- train(Blue_Tarp ~ Red + Green + Blue, data = data,
                  #method = "knn",
                  #tuneGrid = knn_grid,
                  #metric = "Accuracy",
                  #trControl = control)

# results
#plot(KNN_Data)

However, my R was crashing if I tried to test a value above 20.

Thus, after testing and experimenting with multiple k-values, I determined to use k = 20 as a good value for this model.

set.seed(123)

# source: https://rpubs.com/Mentors_Ubiqum/tunegrid_tunelength

KNN_Data <- train(Blue_Tarp~Red+Green+Blue, data=data,
                  tuneGrid = data.frame(k=seq(20,20,1)),
                  method = "knn",
                  metric = "Accuracy",
                  trControl = control)

KNN_Data$results %>% slice_max(Accuracy)
#>    k  Accuracy     Kappa   AccuracySD     KappaSD
#> 1 20 0.9969798 0.9512018 0.0005795998 0.009496795

The threshold test is used to compare the performance of the model across multiple threshold values.

KNN_Data_T <- Test_Thresholds(KNN_Data)
KNN_Data_T %>% slice_max(Accuracy)
#>    k prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1 20            0.4 0.9969482 0.9501678   0.9987586   0.9421353 0.9980904
#>      falseNeg  falsePos
#> 1 0.001241446 0.0578647

This model used a probability threshold of 0.4 and achieved a high accuracy of 0.997 as well as a high precision level of 0.998. The model correctly identified almost all true positives but had a relatively lower specificity of 0.942. Both the false positive and false negative rates were very low.

KNN_Final <- KNN_Data_T[1:9] %>% slice_max(Accuracy)
KNN_Final <- ROC_Plot(KNN_Data, KNN_Final, "KNN")

The AUROC value for the KNN model was 0.9995.

Penalized Logistic Regression (Elastic Net Penalty)

The model performs Penalized Logistic Regression in the form of a ridge regression using the “glmnet” function from the caret package. It creates a grid of tuning parameters using “expand.grid”, trains the model with “train”, and then outputs the results. Ridge regression is a form of PLR that includes a penalty term in the objective function to reduce the complexity of the model and its overfitting. The tuneGrid uses a sequence of lambda values in this model.

# source: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/expand.grid

lambdas <- expand.grid(alpha = 0, lambda = seq(0,1, 0.1))

Ridge_Data <- train(Blue_Tarp~Red+Green+Blue, data = data,
                  method = "glmnet",
                  tuneGrid = lambdas,
                  trControl = control)

Ridge_Data$results %>% slice_max(Accuracy)
#>   alpha lambda  Accuracy     Kappa   AccuracySD    KappaSD
#> 1     0      0 0.9778783 0.4623894 0.0009166678 0.03317173

Now, the model is updated to perform elastic net regularization once again using the “glmnet” function from the caret package. This model uses a weighted combination of L1 and L2 penalties by setting alpha to 0.8, and sets lambda to 0 to use the minimum amount of regularization.

Elastic_Data <- train(Blue_Tarp~Red+Green+Blue, data = data,
                      method = "glmnet",
                      tuneGrid = expand.grid(alpha = 0.8, lambda = 0),
                      trControl = control)
       
Elastic_Data$results %>% slice_max(Accuracy)
#>   alpha lambda Accuracy     Kappa   AccuracySD   KappaSD
#> 1   0.8      0 0.995193 0.9187325 0.0008071933 0.0144479

From the 2 outputs, we can infer that the Elastic Data model with alpha = 0.8 and lambda = 0 produces the best results.

The threshold test is used to compare the performance of the model across multiple threshold values.

Elastic_Data_T <- Test_Thresholds(Elastic_Data)
Elastic_Data_T %>% slice_max(Accuracy)
#>   alpha lambda prob_threshold  Accuracy    Kappa Sensitivity Specificity
#> 1   0.8      0            0.7 0.9955566 0.926304   0.9985625   0.9045384
#>   Precision    falseNeg   falsePos
#> 1 0.9968531 0.001437473 0.09546164

The model produces high accuracy (99%) and fairly low false negative and false positive rates.

PLR_Final <- Elastic_Data_T %>% slice_max(Accuracy)
PLR_Final <- ROC_Plot(Elastic_Data, PLR_Final, "Penalized Logistic Regression")

The AUROC value for the PLR model is 0.9985.

Performance Tables

Cross Validation Results
Tuning AUROC Threshold Accuracy TPR FPR Precision
KNN 20 0.9995 0.4 0.9969 0.9988 0.0579 0.9981
QDA NA 0.9982 0.7 0.9947 0.9993 0.1444 0.9953
PLR 0.8 0.9985 0.7 0.9956 0.9986 0.0955 0.9969
LR NA 0.9985 0.7 0.9956 0.9984 0.0895 0.9970
LDA NA 0.9889 0.9 0.9846 0.9926 0.2587 0.9915
Summary of Calculations for Model Metrics
Model_Name Metrics_Calculation
Logistic Regression Average Across 10 Folds
LDA (Linear Discriminant Analysis) Average Across 10 Folds
QDA (Quadratic Discriminant Analysis) Average Across 10 Folds
KNN (K-Nearest Neighbor) Maximum Value Across 10 Folds
PLR (Penalized Logistic Regression) Average Across 10 Folds

Conclusions

  1. Best Model Performance:

The KNN model performed the best in terms of AUROC, Accuracy, TPR, FPR, and Precision. It achieved an AUROC of 0.9995, which is very close to 1 and indicates that it can distinguish between the different classes with high accuracy. It also had the highest Accuracy of 0.9969, TPR of 0.9988, FPR of 0.0579, and Precision of 0.9981, which means that it had a low rate of false positives and false negatives and a high rate of true positives and true negatives. QDA performed the second-best, with an AUROC of 0.9982, Accuracy of 0.9947, TPR of 0.9993, FPR of 0.1444, and Precision of 0.9953. PLR and LR both performed similarly, with an AUROC of 0.9985, Accuracy of 0.9956, TPR of 0.9986, FPR of 0.0955 for PLR and TPR of 0.9984, FPR of 0.0895 for LR, and Precision of 0.9969 for PLR and 0.9970 for LR. LDA performed the worst among the models, with an AUROC of 0.9889, Accuracy of 0.9846, TPR of 0.9926, FPR of 0.2587, and Precision of 0.9915. KNN is a non-parametric method and does not require any assumptions about the underlying distribution of the data. KNN is also easy to implement and does not require a long training time. However, KNN has several weaknesses. It requires a large amount of memory to store the training data and can be slow to make predictions for large data sets. It is also sensitive to the choice of the distance metric and the number of neighbors, and can perform poorly if the training data is imbalanced.

In the context of this experiment, there may not necessarily be a “best model” for rescuing people from a naturals disaster due to the unpredictability of nature and the fairly similar results each model in this experiment produced. However, in general, KNN appears to be the most suitable model for identifying stranded people due to the impressive statistical results the model produced as well as the ability of the model to adapt to different scenarios by considering nearby instances. That feature could be very helpful in predicting which areas need to be prioritized for rescue operations. Additionally, KNN’s high accuracy and low false negative rate in this experiment make it a strong candidate for this task, as we would want to minimize the number of false negatives and positives. In the real life scenario, rescuing someone who does not need rescuing (or misidentifying a non-person as a rescue target) would result in wasted time and effort and ultimately could cost someone their health and/or life.

  1. Data Set Formulation:

The “Pixels.csv” data set is highly suitable for predictive modeling in this experiment because it contains a clear binary outcome variable (“Yes” or “No” for the “Blue Tarp” variable), as well as several quantitative predictors (“Red”, “Green”, and “Blue” pixel values). This style of data is frequently seen in classification problems, where our goal is to predict the class of a new observation based on the predictor variables. In this case, the predictive modeling tools used, such as logistic regression, LDA, QDA, KNN, and penalized logistic regression, are well-suited for classification problems and can be used to predict whether a new set of Red, Green, and Blue values corresponds to a positive or negative Blue Tarp outcome. Additionally, the data has been carefully selected and pre-processed to remove any potiental irrelevant or noisy information. This helps to reduce the complexity of the data and improve the accuracy of the models. Finally, the data set is relatively large (with over 60,000 data points), which is an important factor for training accurate predictive statistical models.

In summary, the data set in this project is well-structured, cleanly labeled, and pre-processed, which makes it easily usable with predictive modeling tools.

  1. Additional Recommendation:

My recommended action to help improve the results of the experiment would be to improve upon the data collection and classification process for the experiment. The data in this experiment was fairly primitive, with basic colors and conditions making up the data set. Gathering more advanced data would help in two ways: it would allow for greater ability to distinguish tarps vs. non-tarps and it would allow rescuers to prioritize certain areas/people over others. Collecting more data with a wider range of conditions would allow us to provide more training examples for the model to learn from and improve its ability to adapt to new, previously unknown data. Even something such as more complex color classification schemes may make a significant difference in combing through the data for colors that match those of the tarps.

More advanced data classifications would also allow rescuers to determine which people are in greater need of rescue. For example, if a person’s tarp is damaged and wet, they may be in greater need than a group who have an intact and dry tarp for shelter. One may also consider including data regarding elevation (for flooding), general population age, medical status, and more in certain areas to prioritize rescue victims.

However, these recommendations ultimately do have their limitations, as gathering larger amounts of complex data may ultimately result in the project having to graduate past more simple predictive models to more complicated machine learning models.